Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Fix WandB requeueing #4175

Merged
merged 4 commits into from
Nov 19, 2021
Merged

Conversation

c-flaherty
Copy link
Contributor

@c-flaherty c-flaherty commented Nov 16, 2021

Patch description

If a job is pre-empted and then re-queued, the current logic produces multiple runs in WandB, each corresponding a respective execution of the job. We should tell WandB to automatically group runs of the same job into a single run if a process exited unsuccessfully (i.e. was pre-empted).

For more detail, see: https://docs.wandb.ai/guides/track/advanced/resuming.

Testing steps

I ran the following command twice, Ctrl+C'ing after the first one, using my WandB account to verify that it did not produce two runs, but instead joined the logs from both runs. If you want to run this yourself, you need to login to your WandB account and update --wandb-entity arg to your username below.

parlai train_model -t personachat -m transformer/ranker --n-layers 1 --embedding-size 300 --ffn-size 600 --n-heads 4 --num-epochs 10 -veps 0.25 -bs 64 -lr 0.001 --dropout 0.1 --embedding-type fasttext_cc --candidates batch -wblog True --wandb-name a_test --wandb-project test2 --wandb-entity colinflaherty --model-file './tmp/model'

This is what things look like if for the code currently on main:
Screen Shot 2021-11-16 at 3 49 10 PM

Screen Shot 2021-11-16 at 3 49 35 PM

Screen Shot 2021-11-16 at 3 49 26 PM

This is what things look like with this PR's changes added:

Screen Shot 2021-11-16 at 3 52 55 PM

Screen Shot 2021-11-16 at 3 53 16 PM

@c-flaherty
Copy link
Contributor Author

@klshuster

Copy link
Contributor

@klshuster klshuster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix!

@c-flaherty
Copy link
Contributor Author

c-flaherty commented Nov 18, 2021

@klshuster Are you familiar with what might be causing the failing tests? I looked through the logs and didn't see anything suggesting they were related to my code changes. At the same time, they're not failing on main. Could they be flaky?

https://app.circleci.com/pipelines/github/facebookresearch/ParlAI/10365/workflows/3c44fc41-cc37-48ec-ad4b-899479cee6b5/jobs/85523

https://app.circleci.com/pipelines/github/facebookresearch/ParlAI/10365/workflows/3c44fc41-cc37-48ec-ad4b-899479cee6b5/jobs/85518

@stephenroller
Copy link
Contributor

Test looks unrelated. Go ahead and merge! Great fix!

@c-flaherty
Copy link
Contributor Author

Test looks unrelated. Go ahead and merge! Great fix!

@stephenroller Thanks and sounds good! Since I don't have write access for the ParlAI repo, I can't merge. Can you merge it?

@klshuster klshuster merged commit 4b1d07d into facebookresearch:main Nov 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants